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Abstract 

Background: To facilitate deciphering underlying transcriptional regulatory circuits in mouse embryonic stem (ES) 
cells, recent ChlP-seq data provided genome-wide binding locations of several key transcription factors (TFs); 
meanwhile, existing efforts profiled gene expression in ES cells and in their early differentiated state. It has been 
shown that the gene expression profiles are correlated with the binding of these TFs. However, it remains unclear 
whether other TFs, referred to as cofactors, participate the gene regulation by collaborating with the ChlP-seq TFs. 

Results: Based on our analyses of the ES gene expression profiles and binding sites of potential cofactors in 
vicinity of the ChlP-seq TF binding locations, we identified a list of co-binding features that show significantly 
different characteristics between different gene expression patterns (activated or repressed gene expression in ES 
cells) at a false discovery rate of 10%. Gene classification with a subset of the identified features achieved up to 
20% improvement over classification only based on the ChlP-seq TFs. More than 1/3 of reasoned regulatory roles 
of cofactor candidates involved in these features are supported by existing literatures. Finally, the predicted target 
genes of the majority candidates present expected expression change in another independent data set, which 
serves as a supplementary validation of these candidates. 

Conclusions: Our results revealed a list of combinatorial genomic features that are significantly associated with 
gene expression in ES cells, suggesting potential cofactors of the ChlP-seq TFs for gene regulation. 




Genomics 



Background 

A set of core transcription factors (TFs) have been 
reported to regulate the self-renewal and pluripotency of 
mouse embryonic stem (ES) cells. Oct4 has long been 
regarded as one of the master regulators in ES cells. 
Oct4-deficient embryos fail to produce pluripotent inner 
cell mass [1]. Furthermore, while repression of Oct4 
allows trophectoderm development, a less-than-twofold 
increase in Oct4 expression drives differentiation into 
primitive endoderm and mesoderm [2]. Together with 
Oct4, Sox2 explains the first three lineages present in 
preimplantation development; both factors are essential 
to epiblast formation, and in their absence trophec- 
toderm is formed [3]. Nanog, another master factor, can 
bypass leukemia inhibitory factor (LIF)/STAT3 to main- 
tain ES cell self-renewal [4,5], and Nanog-deficient ES 
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cells lose pluripotency and differentiate into extraem- 
bryonic endoderm lineage [5] . 

In addition to the three master TFs, implications of the 
regulatory roles of a few other TFs in mouse ES cells have 
been obtained via experimental efforts. LIF signal pathway 
can sustain self-renewal of the cells by activating STAT3 
[6]; BMPs collaborate with LIF for the maintenance of 
self-renewal via triggering the phosphorylation of Smadl 
to induce Id genes [7] . Myc and Klf4 are two of the four 
factors that can reprogram somatic cells to pluripotent 
cells [8]. Esrrb is required for efficient self-renewal of ES 
cells in vitro; it is required to block differentiation into 
mesoderm, ectoderm and neural crest cells [9]. Depletion 
of Zfx impairs self-renewal of ES cells while over-expres- 
sion of the factor can facilitate the self- renewal [10]. 

To reconstruct the regulatory network in mouse ES 
cells, genome-wide binding data of these important TFs 
have been generated by ChlP-seq/chip experiments 
[11-13]. In particular, Chen et al. [11] made available 
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ChlP-seq data of the following 12 TFs in mouse ES cells: 
Oct4, Sox2, Nanog, STAT3, Smadl, Myc, Klf4, Zfx, Esrrb, 
Mycn, Tcfcp2ll, and E2fl. We refer to them as main fac- 
tors. The authors showed that the binding of these factors 
is correlated with retinoic-acid-induced gene expression 
profiles [9]. Furthermore, Ouyang et al. [14] built statistical 
models upon the ChlP-seq data, which can explain sub- 
stantial variation in gene expression in mouse ES cells 
[15,16]. However, the following problem remains to be 
investigated: Whether other TFs that bind together (or co- 
bind) with main factors, referred to as cofactors, can 
further explain expression patterns? In this study, with the 
ChlP-seq data in [11] and the expression data in [16], we 
attempt to answer this question by exploring association 
between the gene expression and binding sites of potential 
cofactors on genomic islands co-occupied by groups of 
main factors. We intend to evaluate whether the associa- 
tion is different between different expression patterns, and 
according to the evaluation, we recommend a small set of 
cofactors for future follow-up experimental validation. To 
this end, we perform the following analyses (Figure 1) and 
report their results in this article. 

1. Clustering analysis of the ChlP-seq data and the 
gene expression profiles revealed that genes sharing 



similar binding patterns of the main factors may still 
have distinctive expression patterns, leading to our 
hypothesis of existence of cofactors. The /r-means 
clustering was employed in this step. 

2. We then constructed features characterizing co- 
binding effects of main factors and a cofactor candi- 
date on their potential target genes. In order to inte- 
grate genomic information of cofactor candidates with 
the main factor binding data, we scanned genome 
regions defined by the ChlP-seq coordinates with a set 
of known TF motifs to identify co-localization of bind- 
ing sites of main factors and a cofactor candidate in 
the genome; next, we computed a feature score for a 
gene and a feature-a combination of (co-localized) 
main factors and a cofactor candidate-such a score 
intends to quantify association strength between a 
gene and a co-binding combination, giving a numerical 
summary of the genomic sequence information and 
the ChlP-seq data for the gene. 

3. Through hypothesis tests, we identified features 
which have significantly different distributions 
between ES-up genes and ES-down genes represent- 
ing genes up- and down-regulated in ES cells, respec- 
tively (see Results for their definitions). By checking 
expression profiles of a cofactor candidate involved in 
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Figure 1 The diagram of the analysis procedure. The main steps are listed on the left, and their corresponding results/purposes are indicated 
on the right. 
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a feature, we reasoned its regulatory role based on 
test results. Two-sample proportion test and the Wil- 
coxon rank-sum test were used, with the false discov- 
ery rate (FDR) controlled by the Benjamini-Hochberg 
procedure [17]. 

4. Regarding ES-up and ES-down genes as two classes 
of genes and significant features identified in the last 
step as predictors, we adopted gene classification with 
naive Bayes classifiers to further choose a small subset 
of features, which suggest cofactor candidates through 
involved TF combinations. 

5. Finally, we predicted target genes of TFs involved in 
selected features by perturbing learned classifiers and 
examining change in classification. We reported con- 
sistent evidence supporting our prediction from 
another independent data set. The purpose of this 
practice is to provide supplementary validation of 
cofactor candidates revealed by selected features via 
the target prediction. 

For brevity, we refer to cofactor candidates as 
cofactors. 

Results 

Hypothesis of existence of cofactors 

We first define gene sets of our interest and introduce 
some existing findings about the main factors based on 
the ChlP-seq data, and then present their interplay results 
to motivate our hypothesis of existence of cofactors and 
further analyses. 
ES-up and ES-down genes 

Zhou et al. [16] produced gene expression profiles at ES- 
cell stage and early differentiated stage, including 3 profiles 
of undifferentiated ES cells (with high Oct4 expression), 
5 profiles of 2-, 4-, 8-day embryoid bodies with high Oct4 
expression, and 8 profiles of 2-, 4-, 8-, 15-day embryoid 
bodies with low Oct4 expression. Regarding high Oct4 
expression as a marker for ES cells, we refer to the 8 pro- 
files with high Oct4 expression as samples at ES stage and 
the remaining profiles as samples at DF (differentiation) 
stage. Two gene sets have been identified: The genes 
whose expression is higher at ES stage than at DF stage 
with a fold change > 2 and P-value< 0.05 in a two-sample 
comparison, and the genes that have significantly lower 
expression at ES stage comparing with expression at DF 
stage under the same cutoffs. We refer to the first set of 
genes as ES-up genes and the second set of genes as ES- 
down genes. We call these two sets of genes collectively ES 
genes. 

The Myc group and the Oct4 group of the main factors 

In the ChlP-seq data [11], a pair of start and end coordi- 
nates specifies a peak location or a binding site of a TF in 
coarse resolutions, and the number of reads (after nor- 
malization) associated with a peak indicates the intensity 
of binding signals on the site. The authors showed that 



Oct4, Sox2, Nanog, Smadl, and STAT3 have different 
binding patterns from Myc and Mycn. Ouyang et al. [14] 
reported a similar phenomenon. Specifically, they first 
defined an association score to measure association 
strength between a gene and a main factor (see Meth- 
ods). Intuitively, the closer the binding sites (peaks) of a 
main factor to the transcription start site (TSS) of the 
gene, and the stronger the ChlP-seq signal intensities 
(the number of reads) on the sites, the higher the associa- 
tion score between the gene and the main factor. 
Through principal components analysis of the association 
scores of all the main factors, they identified two groups 
of TFs: The Myc group consists of E2fl, Zfx, Mycn, and 
Myc; the Oct4 group has Oct4, Sox2, Nanog, Smadl, 
STAT3, Tcfcp211, and Esrrb. While the Myc group is 
mainly associated with up-regulated genes in ES cells, the 
Oct4 group may either activate or repress genes at ES 
stage according to their regression analysis of the associa- 
tion scores and expression data. 
Gene clusters based on association scores 
Having learned the differences between these two TF 
groups, we intended to examine whether genes can be rea- 
sonably partitioned according to the association scores. 
This examination was to check whether a similar binding 
pattern shared by a set of genes would lead to similar 
expression of the genes. We applied the /r-means cluster- 
ing to genes based on their association scores with the 12 
main factors (reported in [14]), and chose k = 5 as we may 
expect four clusters of genes associated with the binding 
of none/either/both of the two groups, leaving an addi- 
tional cluster accommodating any unexpected binding pat- 
tern. Figure 2 shows the mean association scores for the 
five clusters and the grand mean association scores over 
all the genes (note that the grand means of different TFs 
are all around 1.25 because of quantile normalization in 
[14]). We observed distinguished binding patterns among 
these five clusters. By comparing the mean association 
scores with the grand means, we see that cluster 1 includes 
genes highly associated with all the TFs; in contrast, clus- 
ter 5 consists of genes having weak association with all of 
them. Cluster 2 has genes more intensively associated with 
the Oct4 group than the Myc group; in cluster 3, conver- 
sely, the Myc group shows stronger association than the 
Oct4 group; similar to cluster 2, cluster 4 contains genes 
for which the Oct4 group owns higher association scores 
than the Myc group, although the association is relatively 
moderate comparing with the association for cluster 2. For 
future reference, we name the five clusters according to 
the observed binding patterns: The uniformly -high cluster 
for cluster 1, the uniformly-low cluster for cluster 5, the 
Oct4 cluster for cluster 2, the Myc cluster for cluster 3, and 
the Oct4-moderate cluster for cluster 4. We excluded the 
uniformly-low cluster from further investigation because 
we were interested in finding cofactors working with the 
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main factors in ES cells and all the main factors are weakly 
associated with the genes in that cluster. 

Figure 3 shows the association scores of ES genes in 
the remaining four clusters and their corresponding 
expression profiles (see Additional file 1 for visualization 
of two components of association scores: ChlP-seq 
intensities and distances to TSS). Although the genes in 
a cluster share a similar binding pattern, a substantial 
number of these genes are either up- or down-regulated 
at ES stage. For example, the Myc cluster has around 
the same number of ES-up genes as ES-down genes. We 
hypothesize that cofactor binding may help account for 
the two distinct expression patterns in the same binding 
cluster. For the Oct4 cluster and the Oct4-moderate 
cluster, we hope to find cofactors collaborating with the 



members of the Oct4 group-note that the Oct4 group 
has stronger association than the Myc group in these 
two clusters-and thus, the TFs in the Oct4 group are 
called the main factors of interest for these two clusters; 
analogously, for the Myc cluster, the members of the 
Myc group are the main factors of interest; for the uni- 
formly-high cluster, all the 12 TFs are the main factors 
of interest. 

Meanwhile, we focus on ES genes in these four clus- 
ters (see Additional file 2). Our goal is to search for 
cofactors whose co-binding (with corresponding main 
factors of interest) to ES-up genes and to ES-down 
genes in a cluster are significantly different from each 
other, and can optimally classify the genes within the 
cluster. 
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Myc group 0ct4 group Cellular conditions 



(a) (b) 

Figure 3 Association scores and expression profiles with genes aligned between (a) and (b) (a) Association scores of the ES genes that 
are in the uniformly-high cluster, the Oct4 cluster, the Myc cluster, and the Oct4-moderate cluster, with cyan-colored horizontal lines indicating 
boundaries between the clusters, (b) Gene expression profiles with the first 8 conditions corresponding to ES stage and the rest corresponding 
to DF stage, annotated by the number of ES-up (indicated by "up") genes and ES-down ("down") genes in each cluster. Note that genes within a 
cluster are sorted according to their expression fold change. Association scores and gene expression profiles are the same as the ones generated 
in their original papers. Specifically, the association scores are log-transformed and quantile-normalized [14]; the gene expression profiles in [16] 
are normalized by the invariant set method implemented by dChip [37]. For display purpose, both the association scores and the gene 
expression profiles are rescaled to the range between -2 and 2. 



Features and feature scores 

A feature is defined as association between genes and a 
combination of w main factors of interest and one cofac- 
tor. Such a combination is referred to as a TF combina- 
tion. We consider three types of features corresponding 
to w = 2, 1 and 0: A pair of main factors and a cofactor, a 
single main factor and a cofactor, and a cofactor alone. 



Whereas a cofactor prefers to co-bind to DNA sequences 
with two or one specific main factor for the first two 
types of features, respectively, the last type does not indi- 
cate such preferences. Depending on main factors of 
interest in a cluster, features involving TF combinations 
constructed in different clusters are different. For exam- 
ple, for the Myc cluster, a TF combination consists of 
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one or two of the main factors of the Myc group (E2fl, 
Mycn, Zfx, and Myc) and one cofactor. Thus, different 
clusters have different numbers of features. Given 202 
pre-selected potential cofactors (see Methods for selec- 
tion criteria and Additional file 3 for pre-selected cofac- 
tors), there are 13332 (= (\ 2 ) x 202) features with w = 2 
and 2424 features with w = 1 for the uniformly-high clus- 
ter. For the Oct4 and Oct4-moderate clusters, there are 
4242 and 1414 features corresponding to w = 2 and w = 
1, respectively. For the Myc cluster, there are 1212 fea- 
tures with w = 2 and 808 features with w = 1. Finally, 
there are 202 features involving only cofactors (w = 0) for 
all the clusters. 

To locate places where a co-binding of a TF combina- 
tion may happen, we identified genomic regions that 
contain at least two binding sites of main factors of 
interest in the ChlP-seq data. Such a genomic region is 
called a neighborhood. Please see Methods for neighbor- 
hood construction and Additional files 4, 5, 6 for neigh- 
borhood summary statistics such as their lengths and 
distances to TSS's. As mentioned above, for different 
clusters, features are different depending on different 
main factors of interest, and thus different subsets of 
the ChlP-seq data were used for neighborhood con- 
struction. For example, the ChlP-seq data of E2fl, 
Mycn, Zfx, and Myc were used for the Myc cluster. We 
then scanned the neighborhoods with the 202 pre- 
selected TF motifs from TRANSFAC 12.1 [18] and de 
novo motif discovery to detect possible binding sites of 
these potential cofactors (see Methods). 

Next, a neighborhood is associated with a gene if its 
center is within d bps of the gene's TSS (See Methods for 
how to specify d). Given a gene and its associated neigh- 
borhoods containing binding sites of main factors and a 
cofactor involved in a TF combination, we computed a 
feature score to quantify the association strength between 
the TF combination and the gene (see Methods). Intui- 
tively, a greater feature score is due to a shorter distance 
from the location of the TF combination to the TSS and/ 
or a stronger binding strength of the combination. 

If a TF combination cannot be associated with a gene 
because it does not occur in any associated neighbor- 
hood of the gene, we say that the feature score between 
this combination and the gene does not exist (see Addi- 
tional files 7, 8, 9, 10, 11, 12 for all feature scores). 

Feature significance 

For each feature, we employed two-sample comparisons 
to test whether the feature score distribution of ES-up 
genes is significantly different from the distribution of 
ES-down genes. Specifically, the Wilcoxon rank-sum test 
was utilized to test whether the feature scores in one 
gene set are significantly greater than the feature scores 
in the other (due to non-normality of feature score 



distributions, the nonparametric test was employed); 
two-sample proportion test was used to test whether the 
proportion of the genes associated with a TF combina- 
tion (involved in the feature) in one gene set is signifi- 
cantly greater than that in the other. 

For each type of features, many features lead to a multi- 
ple testing problem for each kind of test. (Note that a fea- 
ture type is defined by w = 2, 1, 0, the number of main 
factors in a TF combination as presented in the last sec- 
tion. For example, features involving two main factors are 
one type of features.) We collected significant features 
under an FDR cutoff of 10%. We define an ES-up feature 
as a feature that shows a higher proportion in ES-up genes 
than that in ES-down genes or has significantly higher 
scores in ES-up genes than ES-down genes. An ES-down 
feature is defined in an analogous way. Table 1 sum- 
marizes the number of significant features in each cluster 
(see Additional file 13 for significant features). For the 
Oct4 cluster, there are only a few significant features with 
w = 0 but many more significant features with w = 1 or 2. 
For the Myc cluster and the Oct4-moderate cluster, there 
are also more significant features involving a TF combina- 
tion than features involving only a cofactor. The result 
demonstrates that we would have ignored important fea- 
tures if we only considered cofactors alone, and that the 
constructed features are effective to capture a cofactor's 
preference for co-binding with main factors for potential 
gene regulation. 

Interestingly, almost all significant ES-up features were 
discovered by the rank-sum test. This fact may suggest 
activation of genes at ES stage is more likely to happen 
when a TF combination is located closer to TSS's, and/or 
it has a stronger binding strength. In contrast, all signifi- 
cant ES-down features were identified by two-sample 
proportion test. This phenomenon indicates that the 
occurrence of a particular TF combination may be suffi- 
cient to increase the chance of down-regulation at ES 
stage-neither its location nor its strength makes differ- 
ences. Since utilizing features in continuous and binary 
fashions informs us different meanings, both are useful 
and should be considered together. 

We can reason regulatory roles of a cofactor involved 
in a significant feature by checking its fold change in 



Table 1 Numbers of ES-up (U) and ES-down (D) features 
(FDR < 10%) in different clusters 





w = 2 




w = 1 




w = 0 






U 


D 


U 


D 


u 


D 


uniformly-high cluster 


2 


0 


4 


3 


1 


1 -1 


Oct4 cluster 


7 


186 


22 


10 


8 


2 


Myc cluster 


3 


76 


7 




0 


63 


Oct4-moderate cluster 


-16-1 


0 


366 


0 


102 


0 



Note: w denotes the number of main factors, indicating feature types. 



Chen and Zhou BMC Genomics 201 1, 12:515 
http://www.biomedcentral.eom/1 471 -2 1 64/1 2/515 



Page 7 of 18 



the expression profiles and checking whether the feature 
is an ES-up or ES-down feature. Fold change in the 
expression profiles is said to be positive if the ratio of 
the mean expression at ES stage over the mean expres- 
sion at DF stage > 2, and negative if the ratio < 0.5. By 
its definition, an ES-up feature is significantly associated 
with genes that are up-regulated at ES-stage (and down- 
regulated at DF stage). Therefore, if a cofactor involved 
in an ES-up feature shows positive fold change, it may 
play an activator role in ES cells. On the other hand, if 
the cofactor is active at DF stage, suggested by its nega- 
tive fold change, it may repress genes at DF stage. For a 
cofactor with neither positive fold change nor negative 
fold change, it may show uniformly high expression, 
defined by average expression index > 500 at both ES 
and DF stages, and thus may function at both stages. In 
this case, we may not tell its role as an activator at ES 
stage from an repressor at DF stage. Following the same 
logic, we can reason the roles of a cofactor involved in 
an ES-down feature as either a repressor at ES stage 
and/or an activator at DF stage. Table 2 summarizes the 
above reasoning. 

Feature selection 

Based on the significant results we obtained, we further 
selected a small subset of features with gene classification, 
where top features ranked by their significance (according 
to the tests described in the previous section) were treated 
as predictors in a naive Bayes (NB) classifier and ES-up 
and ES-down genes as two classes (see Methods). 

Since ES-up and ES-down features indicate different bio- 
logical meanings, we hoped to incorporate information 
from both ES-up and ES-down features into predictors. 
To reduce potential combinations of three types of ES-up 
features (corresponding to w = 2, 1, 0) and three types of 
ES-down features as indicated by six columns in Table 1, 
we restricted our search strategy to combining only one 
type of ES-up features and only one type of ES-down fea- 
tures, and thus there are a total of 9 (= 3 x 3) combina- 
tions to be examined. Specifically, we used top ki ES-up 
features (of one type) ranked by their P-values and top k 2 
ES-down features (of one type) as k (= ki +k 2 ) predictors 
in an NB classifier. Figure 4 illustrates the situation where 
we examined a combination of one type of ES-up features 
(w = 1) and one type of ES-down features (w = 2) based 
on ten-fold cross-validation (CV). It shows that the classi- 
fication accuracy for enumeration of the two types of 



Table 2 Reasoned regulatory roles of a cofactor 



Cofactor expression pattern 


ES-up feature 


ES-down feature 


Positive fold change 


ES activator (EA) 


ES repressor (ER) 


Negative fold change 


DF repressor (DR) 


DF activator (DA) 


Uniformly high 


EA and/or DR 


ER and/or DA 



features for the Oct4 cluster, where on average the combi- 
nation of top six ES-up features and top two ES-down fea- 
tures achieves the highest accuracy (minimum CV error). 
The decreasing trend of the accuracy can also be observed 
as ki and k 2 increase from six and two, respectively. Simi- 
lar enumeration was conducted for the remaining eight 
combinations of feature types, and the combination with 
the minimum CV error was selected. Please see Methods 
for details of the feature enumeration procedure we 
employed given two types of features. 

Table 3 compares NB classifiers based on only main 
factors (and thus using association scores) with those 
utilizing both main factors and selected features invol- 
ving cofactors. Although little improvement was made 
for the uniformly-high cluster and the Oct4-moderate 
cluster, incorporating cofactors can reduce 20% and 11% 
of CV errors compared to only using main factor effects 
for the Oct4 cluster and the Myc cluster, respectively. 
Ouyang et al. [14] adopted the CART algorithm on 
three principal components of association scores of 
main factors for gene classification. As a comparison, 
we also classified ES-up and ES-down genes with the 
CART algorithm based on the same principal compo- 
nents. This approach showed similar performance to the 
NB classifiers using only main factors as predictors 
(Table 3). Again, it was about 16% worse than the classi- 
fiers utilizing cofactor effects for the Oct4 cluster and 
the Myc cluster. 

There is little improvement by utilizing cofactor infor- 
mation in the uniformly-high cluster and the Oct4-mod- 
erate cluster. It may be due to more evident imbalance 
between ES-up and ES-down genes in each of the two 
clusters than the other two clusters-73% ES-up genes in 
the uniformly-high cluster and 72% ES-down genes in 
the Oct4-moderate cluster. The dominant gene sets may 
suggest that gene clustering based on the association 
scores (with only main factor information) already parti- 
tions the genes well for these two clusters so that the 
contribution from features involving cofactors is mar- 
ginal for further classification. 

Table 4 shows selected features based on the best CV 
classification results. We denote a feature by the concatena- 
tion of main factor names and the motif name of a cofactor 
with dot as separator. A motif name starting with M is 
from TRANSFAC, and with N from de novo discovery. For 
example, the feature Smadl.STAT3.M00052_NFKB 
involves the main factors Smadl and STAT3 and a cofactor 
whose motif name is M00052_NFKB in TRANSFAC. 
Figure 5 visualizes how this selected feature helps differenti- 
ate the ES-down and ES-up genes in the Oct4 cluster. The 
proportion of the ES-down genes associated with this fea- 
ture is much higher than that of the ES-up genes. There- 
fore, this features is defined as an ES-down feature. On 
the other hand, Figure 6 shows that the feature Oct4. 
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20 0 

Figure 4 Ten-fold CV accuracy with different numbers of ES-up and -down features Ten-fold CV accuracy with top /t, ES-up features 
involving w = 1 main factor and top k 2 ES-down features involving w = 2 main factors for the 0ct4 cluster, 0 < k h ki s 20. The highest CV 
accuracy 64% is labeled in the figure, involving top 6 ES-up features and top 2 ES-down features. 



Table 3 Ten-fold CV errors and corresponding standard errors 



Method 


uniformly-high cluster 


Oct4 cluster 


Myc cluster 


Oct4-moderate cluster 


NB_MF 


0.27 (0.02) 


0.45 (0.02) 


0.42 (0.02) 


0.29 (0.02) 


CART 


0.27 (0.02) 


0.43 (0.02) 


0.44 (0.02) 


0.28 (0.02) 


MB 


0.26 (0.02) 


0.36 (0.02) 


0.37 (0.02) 


0.29 (0.02) 


(NB_MF - NB)/NB_MF 


0.04 


0.20 


0.11 


0.00 


(CART - NB)/CART 


0.04 


0.16 


0.16 


-0.04 



Note: NB_MF-classification with NB classifiers based on only main factors; CART-classification with CART based on three principal components of association 
scores of main factors as in [14]; NB-classification with NB classifiers utilizing both main factors and cofactors. 
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Table 4 Summary of selected features 
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Note: UH cluster is shorthand for the uniformly-high cluster; Oct4-M cluster is shorthand for the 0ct4-moderate cluster. The column "ES U/D" indicates whether a 
feature is an ES-up feature (U) or an ES-down feature (D). Locuslinks, cofactor gene names, and fold change (FC) of gene expression (N-negative fold change, P- 
positive fold change, H-uniformly high) of cofactors are listed correspondingly. The column "LS" indicates related literature support for a feature. 



M00801_CREB_Q3 has higher feature scores for the ES-up 
gene set than the ES-down gene set, thus making this fea- 
ture an ES-up feature. 

Although features involving two main factors and one 
cofactor are not selected for other clusters, the two ES- 
down features for the Oct4 cluster are of this type. In addi- 
tion, except the Oct4-moderate cluster, there are features 
involving one main factor and one cofactor in all the other 
clusters. Not only these features show strong statistical 



significance (with FDR < 10%), but also they are useful 
from the classification perspective. The evidence from 
these aspects suggests that the cofactors involved in these 
features are likely to be regulators in ES cells. 

Literature support 

For more than one third of the features in Table 4, 
existing literatures provide supporting evidence of 
related biological functions. We list relevant literatures 
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ES stage DF stage 



(a) (b) 

Figure 5 For the ES-down feature Smadl. STAT3.M00052_NFKB, association and expression profiles (rescaled for display purpose) with 
genes aligned between (a) and (b) in the 0ct4 cluster, with the same order in Figure 3. (a) Binary association of the feature with the ES 
genes, with value 2 indicating existence of association and value -2 indicating no association, (b) Gene expression profiles with the first 8 
conditions corresponding to ES stage and the rest corresponding to DF stage. Similar to Figure 3, the gene expression profiles are normalized 
by the invariant set method and rescaled to [-2, 2]. 



for specific features in the last column of the table. Here 
we present support for some features in detail. Reported 
in the Oct4 cluster, Rela, which binds to M00052_NFKB 
sites, was inferred to co-bind with STAT3 and Smadl 
for activating genes at DF stage. Lee et al. [19] suggested 
that Rela, p300, and STAT3 are in the same DNA-bind- 
ing complex in tumors. In addition, they provided 
related references showing that Rela and STAT3 stimu- 
late a highly overlapping repertoire of prosurvival, prolif- 
erative, and proangiogenic genes. The above findings 
show that the collaboration exists between STAT3 and 
Rela, and our analyses further reveal that there is a need 
for these two TFs to work with Smadl in order to regu- 
late genes in ES cells (note that the feature STAT3. 



M00052 NFKB only shows marginal significance 
with P-value = 0.06). The feature Smadl. STAT3. 
M00415_AREB6_04 involves Zebl, which binds to 
M00415_AREB6_04 sites, and Smadl, which is a recep- 
tor-regulated Smad (R-Smad) and a key component of 
the BMP signaling pathway [7]. Consistently, Postigo 
et al. [20] found that Zebl synergizes with Smad- 
mediated transcriptional activation and regulates BMP 
signaling, and that R-Smad and Zebl form a complex 
that recruits p300 much more efficiently, thus account- 
ing for their transcriptional synergy. These results sup- 
port Smadl and M00415_AREB6_04 in the identified 
regulatory code, but STAT3 is necessary for them to 
receive special attention for ES cell studies-the 
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Figure 6 For the ES-up feature Oct4.M00801_CREB_Q3, feature scores and expression profiles (rescaled for display purpose) with 
genes aligned between (a) and (b) in the 0ct4 cluster, with the same order in Figure 3. (a) Feature scores for the ES genes (b) Gene 
expression profiles with the first 8 conditions corresponding to ES stage and the rest corresponding to DF stage. 



significance of the feature Smadl.M00415_AREB6_04 
(with P-value = 0.02) is not as strong as the reported 
three-TF feature (with P-value = 1.51 x 10' 6 ). 

According to Gata6's inferred role from the feature 
Nanog.M00462_GATA6_01-a repressor at DF stage, it 
may repress genes activated by Nanog at ES stage. This 
exemplifies the situation where a main factor and a 
cofactor in an identified TF combination may not always 
work with each other at the same time. The above rea- 
soning is in line with one of the findings in [21], which 
demonstrated that antagonism between Nanog and 
Gata6 leads to segregation of epiblast and primitive 
endoderm within inner cell mass, and that an excess of 
Gata6 pushes the cell into the endoderm lineage. 

We now discuss some studies on cofactors identified 
for the Myc cluster. Consistent with the inferred role of 



Otx2 as a repressor at ES stage, the results in [22] 
revealed that Otx2 regulates neuronal progenitor 
domains by repressing Nkx2.2 in the ventral midbrain. 
In another study [23], Puelles et al. suggested that Otx2 
represses GABAergic differentiation to control glutama- 
tergic progenitors of the thalamus. Another predicted 
repressor at ES stage, Trp53, induces differentiation of 
mouse ES cells via suppressing Nanog expression [24]. 
We reasoned that Pparg activates genes at DF stage. 
Indeed, it is one of the peroxisome proliferator-activated 
receptors (PPARs), a group of three nuclear receptor 
isoforms interacting with other factors to increase tran- 
scription initiation rate [25]. 

Although Klf4 is dispensable for maintenance of self- 
renewal and pluripotency of ES cells, concurrent deple- 
tion of Klf2, Klf4, and Klf5 leads to ES cell differentiation 
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[26]. Hall et al. [27] showed that Oct4 mainly induces 
Klf2 and LIF/STAT3 selectively enhances Klf4 expres- 
sion. Our study revealed that the binding sites of these 
Klfs are preferentially enriched in the neighborhoods 
associated with the ES-down genes in the Myc cluster, 
suggesting possible cooperation between the Oct4 group 
and the Myc group via Klfs on gene repression in ES 
cells-Oct4/STAT3 may introduce Klfs that cooperate 
with the members of the Myc group. 

In tune with Rbpj's reasoned role as an activator from 
the feature M01111_RBPJK_Q4 discovered in the Oct4- 
moderate cluster, the intracellular part of the cell surface 
Notchl receptor (Notchl-IC) alters the function of 
RbpjK to be a transcription activator [28]. Additionally, 
Robert-Moreno et al. [29] showed that activation of 
Gata2 expression by Notchl/Rbpj/t is essential for the 
onset of definitive hematopoiesis in the mouse embryo. 
Our analysis suggested that Zic3 may be a transcriptional 
regulator at ES stage. Lim et al. [30] established that 
repression of Zic3 in ES cells induces expression of sev- 
eral markers of the endodermal lineage and leads to sig- 
nificant reduction of Nanog expression, and thus Zic3 
plays an important role in the maintenance of pluripo- 
tency by preventing endodermal lineage specification in 
ES cells. Zic3 may activate repressors of the endoderm 
markers and thus works as an activator in ES cells. Given 
the above evidence from the existing literatures, other 
identified cofactors may also play some unknown roles at 
ES/DF stages, and thus may be worth follow-up experi- 
mental verification. 

Support by target prediction 

Based on the formulation of an NB classifier, we pre- 
dicted target genes that are regulated by TFs involved in 
a feature. The prediction is through checking the change 
in the probability ratio for classification after a feature is 
excluded from predictors of a classifier that employs the 
selected features (see Methods). To provide evidence for 
predicted targets, we checked gene expression profiles of 
identified cofactors and their predicted targets in another 
data set-the RAi (retinoic-acid-induced) data [9]. Gene 
expression was profiled for 6 days when ES cells under- 
went retinoic-acid-induced differentiation. We treated 
day 0 as ES stage and day 4-6 as DF stage, and computed 
fold change of the expression profile of day 0 over the 
mean expression of day 4-6 as in [16]. We then checked 
whether fold change of an identified cofactor in the RAi 
data matches its fold change in our study. Besides exact 
match, we allowed uniformly high status in one data set 
to match positive/negative fold change in the other 
because the status indicates potential functioning of a 
cofactor at both stages. 

If a match is found, expression change of predicted tar- 
gets is checked for validation: According to our analysis, 



it is expected that expression of targets of an activator at 
ES stage and/or a repressor at DF stage (in terms of a 
cofactor's regulatory role) should decrease after differen- 
tiation, and conversely, expression of targets of a repres- 
sor at ES stage and/or an activator at DF stage should 
increase. The proportion of targets whose observed 
expression satisfied the above expectation is high, ranging 
from 61% to 92%, as shown by Pc in Table 5, where 
selected quantiles of targets' fold change for each feature 
are also presented for reference. Based on this consistent 
result, the target-prediction practice further supports the 
identified cofactors and their regulatory roles. 

Among all the predicted targets, an interesting category 
is the ones that themselves are main factors or identified 
cofactors, indicating cascade regulatory pathways. 
Noticeably, multiple targets in this category are observed 
for each of the three combinations selected in the uni- 
formly-high cluster (Table 4): Otx2, Klf2, Tcfcp211, 
Pou5fl, and Trp53 are potential targets of E2fl and 
Mef2a; Otx2, Pou5fl, Trp53, and Sox2 are possibly regu- 
lated by Sox2 and Smad4 or Smad2; Klf2, Tcfcp2ll, 
Pou5fl, Trp53, and Sox2 are among the predicted targets 
of Esrrb and Elfl or Elf2 or Gabpa. Other clusters do not 
show this phenomenon. It implies that genes in the uni- 
formly-high cluster (that is, genes highly associated with 
all the 12 main factors) may play important roles in 
forming complex network structure and should receive 
particular attention when modeling regulatory networks 
is the goal of a study. Consistently, the three related top 
GO terms enriched in the targets of this cluster are: 
Negative regulation of biological process, regulation of 
developmental process, and regulation of cell 
differentiation. 

We summarize biological implications of our results as 
follows: Based on the definitions of ES-up/-down fea- 
tures, we inferred the regulatory roles of the potential 
cofactors involved in the reported features (Table 4), 
some of which are supported by relevant literatures; the 
gene-classification practice shows that the identified TF 
combinations may explain the gene expression; the 
expected expression change of the predicted targets in 
another data set adds more confidence in the inferred 
regulators' function. The results of this study can provide 
new clues to expand the core regulatory network among 
main factors and to identify novel combinatorial regula- 
tion of the rich expression profiles in mouse ES cells. 

Discussion 

Although the classification improvement of our 
approach for the Oct4 cluster and the Myc cluster com- 
paring to the other methods in Table 3 is substantial 
and encouraging, the absolute CV errors are quite high. 
Such results could be due to the following reasons. 
First, as shown in Table 1, there are more significant 
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Table 5 Summary of expression fold change of predicted targets in the RAi data 
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Note: FC_RAi is fold change of cofactor expression. A/ b is the number of predicted targets for a feature. N e is the number of targets having expression available in 
[9]. P c is the proportion of N e genes with expected expression change. 1Q, Median, and 3Q are, respectively, the 1st quartile, the median, and the 3rd quartile of 
targets' fold change, which is inverted for ES-down features for convenient comparison. 



features than what has been enumerated under our 
search strategy. Some of them are biologically meaning- 
ful in explaining gene expression, and thus have the 
potential to further reduce classification errors. For 
example, the neuronal repressor REST is involved in the 
ES-down feature Oct4.Smadl.M00325_NRSE_B (with P- 
value = 2.07 x 10- 5 ) for the Oct4 cluster, which suggests 
REST collaborating with the key TFs Oct4 and Smadl 
for gene down-regulation at ES stage. In line with its 
repressor role, this cofactor has been shown to maintain 
self-renewal and pluripotency in mouse ES cells through 
suppression of the microRNA miR-21 [31]. Thus, this 
ES-down feature may be worth further investigation. 
However, one challenging question is how features like 
this can be discovered; simply increasing the number of 
top features in NB classifiers may include many noisy 
features and thus degrade classification performance 



according to our experience. Therefore, procedures with 
stronger selective power are needed. Second, sophisti- 
cated learning methods such as boosting and Bayesian 
additive regression trees (BART) may have better classi- 
fication performance (as Zhou and Liu [32] demon- 
strated in TF-DNA binding problems), but these 
methods may have difficulties in interpretation of their 
results. In the framework we employed, not only do the 
statistical tests help reduce search space by focusing on 
significant features for feature selection in classification, 
but they also provide ground for biologically meaningful 
interpretations of features as we have discussed in the 
section of feature significance. Third, although in this 
study we focused on TF regulatory control, TF binding 
is only one way of gene regulation. Other mechanisms 
such as DNA methylation are also involved in regulating 
gene expression. Lack of their information may lead to 
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low classification accuracy. Thus, readers should treat 
our report in this study as intermediate results-further 
investigation of this challenging classification problem is 
needed. 

In the current framework, we consider features involving 
only one cofactor and up to two main factors. Further 
extension to more cofactors and main factors is possible, 
and the extension may gain additional capacity to differ- 
entiate the gene expression patterns. However, with more 
and more specific TF combinations, the support (that is, 
genes that are regulated by the involved TFs) may become 
less and less, and thus the gained capacity may not be 
detected. On the other hand, the extension will lead to 
combinatorial explosion. For example, including one more 
cofactor will bring -100 times more features into consid- 
eration. This will result in much more tests for each feature 
type, which may cause high FDR for the current data size. 
In addition, following the logic of combinatorial regulatory 
codes, another kind of features would be a pair of main fac- 
tors without considering a cofactor in our framework. 
Alternatively, one could treat one main factor as a cofactor 
of the other main factor. We explored this possibility and 
found that little improvement can be made in gene classifi- 
cation. This may be because classification happens within 
each gene cluster and collaborative efforts among main fac- 
tors have already been captured by clustering. 

Besides ES genes, other genes with less substantial fold 
change, referred to as ES-neutral genes, may also be infor- 
mative. If the function of a TF combination is only to acti- 
vate genes in ES cells, we expect that the contrast in the 
feature between ES-up and ES-down genes may be more 
significant than the contrast between ES-up genes and ES- 
neutral genes. The reason is that the TF combination may 
have a lower chance to randomly occupy regulatory ele- 
ments of an ES-down gene (and thus would up-regulate it 
at ES stage) than an ES-neutral gene, which may be acti- 
vated but has insignificant fold change. Similarly, focusing 
on ES genes may be helpful for detecting a TF combina- 
tion as a pure ES repressor. On the other hand, if a parti- 
cular combination works as either an activator or a 
repressor in ES cells depending on other cellular context 
or targets, a feature may then be strongly associated with 
both ES-up and ES-down genes, leading to insignificant 
test results and defeating detection. In this case, the ES- 
neutral gene set, containing genes that are not regulated 
by the combination, may provide a better contrast for 
detecting such a combination. The detection may be con- 
ducted in the same framework with one of ES gene sets 
replaced by ES-neutral genes. In summary, many possible 
directions can be explored based on this study. 

Conclusions 

This study suggests a list of TF combinations which may 
play important regulatory roles in ES cells based on 



computational analyses. They serve as top candidates for 
experimental evaluation. We provided computational 
evidence of the finding from three aspects: 1. The fea- 
tures involving identified TF combinations show strong 
statistical significance; 2. the classifiers based on them 
have the optimal performance in classifying gene expres- 
sion and also achieve substantial improvement over clas- 
sifiers utilizing only main factor information; 3. their 
predicted target genes (based on classification) in 
another independent data set show expected fold change 
as in our prediction. In addition to the above evidence, 
existing literatures provide support for reasoned regula- 
tory roles of some identified cofactors. In summary, this 
study effectively reveals combinatorial co-binding pat- 
terns which involve potential regulators in mouse ES 
cells. 

Methods 

Association scores 

Suppose that n binding sites (peaks) of a main factor are 
located within 10 s base pairs (bps) of a gene's TSS. The 
association score between the gene and the main factor 
is defined by 

di 

Y^fiido, (1) 

1=1 

where f\ denotes the ChlP-seq signal intensity on the 
i'th main factor binding site, d t represents the distance 
from the site to the gene's TSS, and d 0 is a weighting 
constant, controlling the speed of the exponential decay. 

Motif selection 

If a TF's expression has small fold change or is very low, 
the TF may not function at either ES or DF stage. 
Therefore, for motif scanning, we selected motifs whose 
corresponding TF's expression is either uniformly high 
or shows positive/negative fold change. Please refer to 
Additional file 3 for the list of the selected motifs and 
their expression fold change. 

Neighborhood construction 

With the following procedures, we constructed genomic 
regions where we searched for combinatorial TF binding 
sites for gene regulation. First, a genomic neighborhood 
of a peak (from the ChlP-seq data) is created via 
extending 500 base pairs (bps) from the peak center 
coordinate in both directions; then, two neighborhoods 
are merged into one neighborhood if a peak center in 
one neighborhood is located within 500 bps of a peak 
center in the other neighborhood; the merging operation 
is made recursively till no two neighborhoods can be 
merged. We focused on neighborhoods which include at 
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least two peaks. The above construction and considera- 
tion were inspired by [11] and [12] -they had discovered 
a good number of genomic locations co-occupied by 
multiple TFs based on ChlP-seq and bioChlP-chip 
experiments, respectively. The neighboring peaks indi- 
cate that these genomic islands may be potential regula- 
tory regions where co-binding is likely to happen. 

Binding-site scanning procedure 

We now describe how we computed motif score of a 
sequence segment of width w-a w-mer-and determined 
whether it is a binding site. Given a sequence 5 of 
length L, we first estimated v|//,, the transition matrix of 
a first-order Markov model as a background model. 
Given the position-specific weight matrix (PWM) for a 
motif v|/ m , we then calculated the following probability 
ratio r as the motif score for a w-mer located at the ith 
position of S for i = 1,..., L - w + 1: 

r = P{S[i,i + w-l]\V m ) 
P(S[i,i + w- l]|Wi,) ' 

We also scanned the complementary strand of 5 for 
computing motif scores. Note that motif scanning was 
applied after repeat sequences were masked out. 

To decide whether a motif score is significant, we ran- 
domly drew 20 matched control sequences from the 
mm8 genome using the CisGenome program [33] for a 
neighborhood sequence. For every control sequence, its 
length and distance from its center to its closest gene's 
TSS matched the length of the neighborhood sequence 
and the distance from the neighborhood center to the 
TSS of the gene closest to the neighborhood, respec- 
tively. The later match is shown to provide a compara- 
tive or better match to binding region GC-content than 
genome-wide controls [34]. Thus, it improves accuracy 
of binding site detection. We first scanned matched 
control sequences for a cluster with Oct4 PWM and 
built a null distribution for the motif scores of Oct4. 
We then found a cutoff P-value 9.09 x 10 s correspond- 
ing to r = 1000, which is reasonable according to our 
past experience. We then collected significant sites as 
Oct4 binding sites where corresponding w-mers' scores 
are greater than 1000. For other motifs, we first used 
P-value 9.09 x 10" to find score cutoffs from null distri- 
butions and then recorded binding sites according to the 
cutoffs. 

Computing feature scores 

Let k index the associated neighborhoods of a gene. 
Suppose that n k and m k are the number of main factor 
binding sites (peaks) and the number of binding sites of 
a cofactor in the kth. neighborhood, respectively. The 
feature score is defined by 



i i — - — fo=\ 4i + Efei hi) 

J2l(l\M nk (Uski) ]e d ° , 

h i=l j=l 

where f ki denotes the ChlP-seq signal intensity on the 
i'th main factor binding site, g k j denotes the motif score 
of the y'th cofactor binding site, and d ki and d k j represent 
the distances to the gene's TSS from the two sites, 
respectively. Similar to the distance limit for allowing a 
binding site to be associated with a gene in [14], the 
longest distance for associating a neighborhood with a 
gene, d, is taken to be 10 6 -that is, neighborhoods that 
are 1 million bps away from the TSS are not considered, 
and d 0 = 5000 is a constant for controlling the exponen- 
tial decay. Feature scores are computed and used in nat- 
ural-log scale. 

We can understand a summation term, which corre- 
sponds to contribution from the kth neighborhood, in 
(3) by parts: The part in the brackets is a combined 
binding strength based on respective geometric 
averages-one component from main factors and the 
other from a cofactor; it is then exponentially down- 
weighted according to the average distance from binding 
sites to the TSS. 

The feature score definition is an extension of the 
definition of association scores (1). Since an association 
score is defined for only one main factor with a gene, it 
concerns only the main factor's ChlP-seq signal intensi- 
ties and the distances from their binding sites to the 
TSS. The study [14] showed that in predicting gene 
expression association scores are superior to traditional 
scores based on binary association of binding sites with 
a gene, which do not utilize ChlP-seq signal intensities 
and relative distances between sites and a gene. In our 
study, we extend association scores to feature scores for 
quantifying the association between a TF combination 
and a gene. To avoid possible confusion about the con- 
cepts of association scores and feature scores, Figure 7 
summarizes their differences and relationship. 

NB classifier and feature enumeration 

Suppose there are k predictors in total, denoted by X 
with Xj being the jth predictor for / = 1,..., k. Let Y 
denote class label: Y = 1 indicates that a gene is an ES- 
up gene; Y = 0 indicates that a gene is an ES-down 
gene. Let T represent related parameters. An NB classi- 
fier is defined by 

P(Y= l|X,vIQ P{Y=l)Y\UnXj\*ji) 
P(Y=1|X,*) = P(y = 0)n^P(^l*;o)' 

The classifier compares probabilities that a gene 
belongs to a class given its predictors, and assigns a 
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i 
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Extended to 



Association strength between 
a gene and a main factor 



Feature score 


I 


i 

Equivalent 



Association strength between 
a gene and a TF combination of 
w main factors and a cofactor 



Figure 7 Concepts of association scores and feature scores and their relationship. The concepts are briefly defined in the bottom two 
boxes. 



class label y = argmax ^ (li0 } P(Y = y | X, *P) to a gene. 
In the equation, P(Y) denotes the prior probabilities of 
the two classes, and P(Xj | *P jy) is the conditional den- 
sity of the y'th predictor given Y (= 1 or 0). The probabil- 
ity of a gene belonging to a class given predictors X is 
computed under the assumption that predictors are 
conditionally independent of one another given a class 
label. 

In Equation (4), the prior probabilities of the two 
classes P(Y) are p\ and 1- pi for Y = 1 and 0, respec- 
tively. In the density of the jth predictor P(Xj \ f jY ), Xj 
is a random variable denoting (1) whether the feature 
score of the jth predictor exists and (2) the feature score 
if it exists (note that a feature score may not exist for a 
gene if the feature cannot be associated with the gene, 
as explained at the end of the section of Feature and 
feature scores): Let the indicator variable L = 1 i£ the 
feature score of the jth predictor exists for a gene, and L 
= 0 otherwise; then if Ij = 1, P(Xj \ ¥ jY ) = X JY f jY (Xj), 
where Xj Y is the probability that a gene with class label 
Y has a feature score of the jth predictor, and fj Y is the 
density function of the jth predictor's feature score in 
class Y. On the other hand, if the feature of the jth pre- 
dictor cannot be associated with a gene, Xi is then 
reduced to the indicator variable Ij, and thus P{Xj | *P j Y ) 
= 1 - Xi Y given L = 0 in this case. To summarize the 
above two cases, P(X ; |* ;Y ) = (1 - kpr) 1 ' 1 * [IfffeiXj)) 11 ■ 

The parameters and density functions were estimated 

N, 

from a training data set as follows ft, = — , where N-i is 

N 

the number of ES-up genes, and N is the total number 

of ES genes. Xjy = — — , where Nj Y is the number of the 

My 

genes having X ; 's scores in class Y, and N Y is the num- 
ber of the genes in class Y. We adopted a kernel method 
to estimate the density function fj Y , 



where x i yi is the score of X> for the ith gene in class 
Y, K(-) is the standard normal density, and hj Y is the 
bandwidth of the kernel for Xj in class Y, calculated 
according to Silverman's rule of thumb (see page 48 of 
[35]). Since top features or predictors from two-sample 
proportion test are based on only binary information, 
the density function of feature scores fjyix) is excluded 
from the computation of P(Xi \ Wiy) for these predictors. 

Given one type of ES-up features and one type of ES- 
down features, we enumerated every possible combina- 
tion of /<i and k 2 (0 < ki, k 2 < 20) for the above NB clas- 
sifier formulation. We considered utilizing up to 20 top 
features due to the following two reasons: 1. Such an 
option covers 2/3 of the significant cases in Table 1, 
that is, the number of significant features (of one type) 
is less than 20 for 16 out of the 24 cases in the table; 2. 
we observed that CV errors in general increase consid- 
erably when /ci and/or k 2 become large (for example, 
Figure 4). Predictors involving main factors alone were 
also included (as control variables), with their associa- 
tion scores utilized in the same way as feature scores. 
When k\ = k 2 = 0, these are the only predictors. Mini- 
mum CV errors were used to determine the number of 
features /q and k 2 . 

Target prediction 

A gene is predicted to be regulated by a TF combination 
involved in an ES-up (-down) feature if this gene satis- 
fies the following two conditions: 1. It is an ES-up 
(-down) gene; 2. after the feature is excluded from pre- 
dictors of an NB classifier, we observe a decrease (an 
increase) in the ratio of the probability of the gene as an 
ES-up gene over that as an ES-down gene with the 
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remaining predictors (at the left-hand side of Equation 
(4)). The first condition ensures that target prediction 
be consistent with ES-up/-down feature definitions, 
which are based on statistical significance of one-sided 
test results; the operation in the second condition 
intends to mimic knockout of a TF combination and to 
check expected expression change of potential targets. 
Additional file 14 lists predicted targets ranked by fold 
change in the probability ratio. Readers can refer to 
Additional file 15 for different Gene Ontology (GO) 
terms, which were retrieved by the DAVID program 
[36], enriched in the targets of different clusters to learn 
about their biological functions. 

Additional material 
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